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Abstract 

We introduce a new approach to solving path-finding prob¬ 
lems under uncertainty by representing them as probabilis¬ 
tic models and applying domain-independent inference al¬ 
gorithms to the models. This approach separates problem 
representation from the inference algorithm and provides a 
framework for efficient learning of path-finding policies. We 
evaluate the new approach on the Canadian Traveller Prob¬ 
lem, which we formulate as a probabilistic model, and show 
how probabilistic inference allows high performance stochas¬ 
tic policies to be obtained for this problem. 


Introduction 

In planning under uncertainty the objective is to find the op¬ 
timal policy — a policy that maximizes the expected re¬ 
ward. In most interesting cases the optimal policy can¬ 
not be found exactly, and approximation schemes are used 
to discover the policy, either represented explicitly or as 
an implicit property of the planning algorithm, through 
reinforcement learning. Approximation schemes include 
value/policy iteration, Q-leaming, policy gradient meth¬ 
ods ( [Sutton and Barto 1998|l, as well as me thods based 
on heuristic search (jBonet and Geffner 200 Ijl and Monte 


Carlo sampling such as MCTS ( jKocsis and Szepesvari 2006 
[Browne et al. 201^ . 


Domain-independent planning algorithms ([Bonet and 


Geffner 2001 [ [Haslum, Bonet, and Geffner 20051 [Helmert 


20061 can be applied to different domains with little modi¬ 
fication, however for many applications domain-dependant 
techniques are still critical in order to obtain a high perfor¬ 
mance policy, and put the burden of implementation on the 
domain expert formulating the planning problem. 

The framework of probabilistic inference ( [Pearl 1988) l 
proposes solutions to a wide range of Artificial Intelligence 
problems by representing them as probabilistic models. Ef¬ 
ficient domain-independent algorithms are available for sev¬ 
eral classes of representations, in particular for graphical 
models ( [Lauritzen 1996j l, where inference can be performed 
either exactly and approximately. However, graphical mod¬ 
els typically require that the full graph of the model to 
be represented explicitly, and are not powerful enough for 
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problems where the state space is exponential in the prob¬ 
lem size, such as the generative models common in plan¬ 
ning (jSzorenyi, Kedenburg, and Munos 20l4]l. 

Probabilistic programs ^Goodman et al. 2008[ [Mans 


inghka, Selsam, and Perov 2014[ Wood, van de Meent, 
find Mansinghka 20l4| l can represent arbitrary probabilistic 
models, efficient approximate inference algorithms have re- 
cently emerged ([ Wingate, Stuhlmtiller, and Goodman 2011 
Wood, van de Meent, and Mansinghka 2014pPaige et al. 
2014|l. In addition to expressive power, probabilistic pro 


gramming separates modeling and inference, allowing the 
problem to be specified in a simple language which does not 
assume any particular inference technique. 

In this paper, we show a connection between probabilis¬ 
tic inference and path finding, which allows many path¬ 
finding problems to be cast as inference problems using 
probabilistic programs. Based on this connection, we pro¬ 
vide a generic scheme for expressing a path-finding prob¬ 
lem as a probabilistic program that infers the path-finding 
policy. We illustrate this generic scheme by its application 
to the Canadia n Traveller Problem (jPa padimitriou and Yan- 
| nakakis 1991[ [Bar-Noy and Schieber 1991| [Nikolova and 
Karger 2008|l. In the empirical evaluation, we show that 


high performance stochastic policies can be obtained using 
domain-independent inference techniques. In the conclud¬ 
ing section, we discuss other possible areas of application of 
probabilistic programming in planning, as well as possible 
difficulties. 


Preliminaries 

Probabilistic Programming 

Probabilistic programs are regular programs extended by 
two constructs ( [Gordon et al. 2014) 1: 

• The ability to draw random values from probability distri¬ 
butions. 

• The ability to condition values computed in the programs 
on probability distributions. 

A probabilistic program implicitly defines a probability dis¬ 
tribution over the program’s output. Formally, we define a 
probabilistic program as a stateful deterministic computa¬ 
tion V with the following properties; 

• Initially, V expects no arguments. 




















































• On every call, V returns either a distribution F, a distri¬ 
bution and a value (G, y), a value z, or _L. 

• Upon returning F, V expects a value x drawn from F as 
the argument to continue. 

• Upon returning (G, y) or z, V is invoked again without 
arguments. 

• Upon returning _L, V terminates. 

A program is run by calling V repeatedly until termina¬ 
tion. Every run of the program implicitly produces a se¬ 
quence of pairs {Fi,Xi) of distributions and drawn from 
them values of latent random variables. We call this se¬ 
quence a trace and denote it by x. A trace induces a se¬ 
quence of pairs {Gj,yj) of distributions and values of ob¬ 
served random variables. We call this sequence an image 
and denote it by y. We call a sequence of values Zk an out¬ 
put of the program and denote it by z. Program output is 
deterministic given the trace. 

By definition, the probability of a trace is proportional to 
the product of the probability of all random choices x and 
the likelihood of all observations y: 

\A l»l 
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The objective of inference in probabilistic program V is to 
discover the distribution of z. 

Several implementations of general probabilistic pro- 
gramming languages are available ([Goodman et al. 2008[ 


Mansinghka, Selsam, and Perov 2014J Wood, van de Meent, 

and Mansinghka 2014[ ). Inference is usually performed us¬ 
ing Monte Carlo sampling algorithms for probabilistic pro- 
grams ([ Wingate, Stuhlmiiller, and Goodman 201 1[ Wood, 
van de Meent, and Mansinghka 2014f)Paige et aL2014| l. 
While some algorithms are better suited for certain infer¬ 
ence types, most can be used with any valid probabilistic 
program. 


Canadian Traveller Problem 

Canadian Traveller Problem (CTP) was introduced in 
padimitriou and Yannakakis 1991) l as a problem of finding 
the shortest travel distance in a graph where some edges may 
be blocked. There are several variants of CTP (|Bar-Noy and 


Schieber 1991[ |Nikolova and Karger 2008 Bnaya, Felner, 

and Shimony 20091; here we consider the stochastic version. 
In the stochastic CTP we are given 

• Undirected weighted graph G = (U, E). 

• The initial and the final location nodes s and t. 


• Edge weights w : E ^ TZ. 

• Traversability probabilities: Po ■ E ^ (0,1]. 

The actual state of each edge is fixed for every problem in¬ 
stance but becomes known only upon reaching one of the 
edge vertices. The goal is to find a policy that minimizes the 
expected travel distance from s to t. The travel distance is 
the sum of weights of all traversed edges during the travel, 
where traversing in each direction is counted separately. 



Figure 1: A path in the graph of a probabilistic program 


CTP problem is PSPACE-hard ( [Fried et al. 2013| l, how¬ 
ever a number of heuristic algorithms were proposed, in¬ 
cluding high-quality policies based on Monte Carlo meth¬ 
ods ( Eyerich, Keller, and Helmert 2010| l. Policies are empir¬ 
ically compared by averaging the distance travel over mul¬ 
tiple instantiations of the actual states of the edges (open 
or blocked) according to the traversal probabilities. Since 
the travel distance is defined only for instance where a path 
between s and t exists, instantiations in which t cannot be 
reached from s are ignored. 

A trivial travel policy is realized by traversing the problem 
graph in a depth-first order until the final location is reached. 
The expected travel distance of the policy is bounded from 
above by the sum of weights of all edges in the graph by 
noticing that every edge is traversed at most once in each 
direction, and at most half of the edges are traversed on av¬ 
erage. 


Duality between Path Finding and 
Probabilistic Inference 

We shall now show a connection between path finding and 
probabilistic inference. This connection was noticed ear¬ 


lier (Shimony and Chamiak 1991) and was used to search 


for the maximum a-posteriori probability (MAP) assignment 
in graphical models using a best-first search algorithm. Here 
we further extend the analogy and establish a bilateral cor¬ 
respondence between inferring the distribution defined by 
a probabilistic model and learning the optimal policy in a 
path-finding problem. 

We proceed in two steps. First, following earlier work, 
we establish a connection between a MAP assignment and 
the shortest path. Then, based on this analogy, we explain 
how discovering the optimal policy in a generative model 
can be translated into inferring the output distribution of a 
probabilistic program and vice versa. 

Inference on probabilistic programs computes a represen¬ 
tation of distribution Q. An equivalent form of Q is ob¬ 
tained by taking logarithm of both sides: 

kl |j/l 

logpvix) = '^\ogpFi{x^) log PG3 {y 3) + c (2) 
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where C is a constant that does not depend on x. To find 
the MAP assignment xmap, one must maximize logp(x). 
One can view x as a specification of a path in a graph where 
each node corresponds to either (E), Xi) or {Gj,yj), and the 
costs of edges entering (E), Xi) or (Gj, j/jlis — logp^i {xi) 
o — logpHjiyj), correspondingly (Figurem). Then, finding 





































the MAP assignment is tantamount to finding the trace x that 
produces the shortest path. (| S himony and Charniak 199 1| 
Sun, Druzdzel, and Yuan 2007) l use this correspondence in 
their MAP algorithms for graphical models. 

We shall turn now to a more general case when the MAP 
assignment of a part of the trace x® is inferred. In a prob¬ 
abilistic program, this is expressed by selecting x® as the 
program output, 2 ^ x®. The distribution of 2 is marginal¬ 
ized over the rest of the trace x^® — ^ \ 3:®, and finding the 
MAP assignment for x® corresponds to finding the mode of 
the output distribution; 


^MAP = argmaxp-p( 2 ) 


oc arg max i 


Kl l2/l 
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P'p{x^^)dx 


(3) 


where the integrand in equation ([^ depends on x^®. Just 
like in the case of MAP assignment to all random vari¬ 
ables, equation ([^ corresponds to a path finding problem: 
X® can be viewed as a policy, and determining cor¬ 

responds to learning a policy which minimizes the expected 
path length 
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While in principle policy learning algorithms could be used 
for MAP estimation, a greater potential lies, in our opinion, 
in casting planning problems as probabilistic programs and 
learning the optimal policies by estimating the modes of the 
programs’ distributions. We suggest to adopt the Bayesian 
approach, according to which prior beliefs are imposed on 
policy parameters, and the optimal policy is learned through 
inferring posterior beliefs by conditioning the beliefs on ob¬ 
servations. We explore this approach in the next section. 


Stochastic Policy Learning through 
Prohahilistic Inference 

We have shown that in order to infer the optimum policy, 
a probabilistic program for policy learning should run the 
agent on the distribution of problem instance and policies, 
and compute probability of each execution such that the log¬ 
arithm of the probability is equal to the negated travel cost. 
The generic program shown in Algorithm[T]achieves this by 


Algorithm 1 Policy learning through probabilistic infer¬ 
ence^_ 

Require: agent. Instances, Policies 
1: instance ■<— DRAW(Jnsfances) 

2: policy ^ Dkpan{P olicies) 

3: cost -p- RUN(agenf, instance, policy) 

4: Observe( 1, Bemoulli(e““''‘)) 

5: Print(poZ*C2/) 


randomly drawing problem instances and policies from their 


distributions supplied as program arguments (lines and ^ 
and updating the log probability of the sample (line by 
calling Observe. Observe adds the log probability of its 
first argument, the value, with respect to its second argu¬ 
ment, the distribution. Consequently, the log probability of 
the output policy 

^ogp-p {policy) = 

logpPoH„es{policy) + + C 

= -cost{policy) + \ogppoucies{policy) + C (5) 

When policies are drawn from their distribution uniformly, 
logppoUcies{policy) is the same for any policy, and does not 
affect the distribution of policies specified by the probabilis¬ 
tic program: 

\ogpp{policy) = —cost{policy) C' (6) 

In practice, this is achieved by using a uniform distribu¬ 
tion on policy parameters, such as the uniform continuous 
or discrete distribution for scalars, the categorical distribu¬ 
tion with equal choice probabilities for discrete choices, or 
the symmetric Dirichlet distribution with parameter 1 for 
real vectors. Alternatively, if different policies have different 
probabilities with respect to the distribution Policies from 
which the policies are drawn, their log probabilities (taken 
with the opposite sign) have the interpretation of the costs 
of the corresponding policies and provide a means for spec¬ 
ifying preferences of the model designer with respect to dif¬ 
ferent policies. In either case, the optimal policy is approxi¬ 
mated by estimating the mode of the program output. 

When policies are drawn uniformly, the scale of the travel 
cost does not affect the choice of optimal policy. However, 
as follows from equation the shape of the probability 
density (or probability mass for discrete distributions) de¬ 
pends on the cost scale — the higher the cost, the sharper the 
shape. Thus, by altering the cost scale we can affect the per¬ 
formance of the inference algorithm: on one hand, the mode 
estimate of a sharper function can be computed with higher 
accuracy, on the other hand, when pp {policy) changes too 
fast with its argument in the high probability region, approx¬ 
imate inference algorithms converge slowly. The right scale 
depends on the probabilistic program, and finding the most 
appropriate scale is a parameter optimization problem. 

Note that the probabilistic program for policy learning is 
independent of the inference algorithm which would be used 
to obtain the results. The programmer does not need to make 
any assumptions about the way the mode of the output distri¬ 
bution is estimated, and even whether approximate or exact 
inference (if appropriate) is performed. 

Case Study: Canadian Traveller Problem 

We evaluated the proposed policy learning scheme on the 
Canadian Traveller Problem (Algorithm |^. The algorithm 
draws CTP problem instances from a given graph with 
traversability of each edge randomly selected according to 
the probabilities p, and learns a stochastic policy based on 
depth-first search — the policy is specified by a vector of 
probabilities of selecting each of the adjacent edges in every 
















node. When the policy is realized, the selection probabilities 
are conditioned such that only open unexplored edges are se¬ 
lected, in accordance with the base depth-first search traver¬ 
sal. Dirichlet{l^^^^'"^) is a uniform distribution, hence the 


Algorithm 2 Learning stochastic policy for the Canadian 
traveller problem 

Require: CTP{G, s,t,w,p) 

1: instance ^ Draw(CTP(G', w,p)) 

2: for u e F do 

3: policy(y) -s— DRAW(Dirichlet(l'^®®("^)) 

4: end for 
5: repeat 

6: {reached, distance) ^ SjDFS(instance, policy) 

7: until reached 
8: Observe(1, Bernoulli 
9: PRlNTipolicy) 


log probability of a trace is equal to the path cost taken with 
the opposite sign. StDFS (line|^ is a flavour of depth-first 
search which enumerates node children in a random order 
according to the policy for the current node. An optimal 
policy is expected to assign a higher probability to edges 
leading to shorter paths having lower probability to become 
blocked. 

To assess the quality of learned policies we generated sev¬ 
eral CTP problem specifications by triangulating a randomly 
drawn set of either 50 or 20 nodes from Poisson-distributed 
points on a unit square. The average DFS travel cost in 
fully traversable instances was 7.9 for 50 node instances, 
and 5.7 for 20 node instances. The same traversal proba¬ 
bility in the range [0.35, 1.0] is assigned to every edge of 
the graph (the bond percolation threshold for Delaunay tri¬ 
angulation is «0.33 ( Becker and Ziff 2009 | l, hence instances 
with p < 0.3 are disconnected with high probability). A 50 
node instance is shown in Figure]^ The s and t nodes are 
marked by the red circles, and edge weights are equal to the 
Euclidean distances between the nodes. 

Lightweight Metropolis-Hastings ( [Wingate, Stuhlmuller,| 
[and Goodman 20iT| l was used for inference. We learned a 
policy for each problem specification by running the infer¬ 
ence algorithm for 10,000 iterations. Then, we evaluated 
policies returned at different numbers of iterations on 1,000 
randomly drawn instances to estimate the average travel 
cost. The average computation time of learning and eval¬ 
uation per instance was «80s on Intel Core i5 CPU. 

The results are shown in Figure where the solid lines 
correspond to the average travel cost over the set of prob¬ 
lems of the corresponding size, and dashed lines to 95% 
confidence intervals. For both 50 and 20 node problems, the 
policy mostly converged after «1000 iterations, achieving 
50-80% improvement compared to the uniform stochastic 
policy. While a further refinement of the policy is possible, 
a different type of policy should be learned to obtain sig¬ 
nificantly better results, for example, a deterministic policy 
which takes online information into account. This, however, 
would complicate the probabilistic program which we chose 
to keep as simple as possible — the actual implementation 



Figure 2: An instance of CTP with 50 nodes. Initial (1) and 
final (25) locations are marked by red circles; edge weights 
are Euclidean distances between edge vertices. 



Eigure 3; Average travel cost vs. number of samples for 
problems with 50 and 20 nodes and travers ability proba¬ 
bilities 0.85 and 0.5. The policies mostly converged after 
wlOOO samples. 


of the program is just above 100 lines of code, including the 
implementation of DES. 

A learned policy for a 50 node problem is visualized in 
Eigure Edge widths correspond to the confidence about 
the policy for the edge. Edges with higher precision (lower 
variance) of the policy are broader. Edge color is blue when 
a traversal through the edge is much more probable in one 
than in the other direction, and green when traversal in either 
direction has the same probability, with shades of green and 
blue reflecting how directed the edge is. As we would expect 
in a converged policy, edges in the center of the graph are 
thicker, that is, more explored, than at the periphery, where 
changes in the policy are less likely to affect the average 




































Figure 4; Visualization of policy learned for blocking proba¬ 
bility 0.5 on instance in Figure]^ Broader edges correspond 
to more explored components of the policy. 


travel cost. Bright blue (uni-directional) edges are mostly 
radial relative to the direction from the initial position (node 
1) to the goal (node 25), and many well-explored tangential 
edges are green (bi-directional). This corresponds to an in¬ 
tuition about the policy — traversals through radial edges 
are mostly in the direction of the goal, and through the tan¬ 
gential edges in either direction to find an alternative route 
when the edge leading to the goal is blocked. 


tions are uncovered, and a more powerful and flexible in¬ 
ference algorithm will have to be developed. 

The policy learning algorithm presented here follows the 
offline learning scheme — the policy is selected before act¬ 
ing, and then used unmodified until the goal is reached. Al¬ 
though this is, indeed, the easiest way to cast policy learning 
as probabilistic inference, online learning can also be imple¬ 
mented so that when additional computation during acting 
is justified by the time cost, the policy is updated based on 
the information gathered online, as in some of state-of-the- 
art algorithms for CTP ( |Eyerich, Keller, and Helmert 2010[ l. 
Moreover, the time cost of updating the policy incremen¬ 
tally based on the new evidence is lower than of inferring a 
new policy due to the any-time nature of Bayesian updating. 
Online inference is a subject of ongoing research in proba¬ 
bilistic programming. 

By performing inference on a probabilistic program, we 
obtain a representation of distribution of policies rather than 
a single policy. We then use this distribution to select a pol¬ 
icy. When the inference is performed approximately, which 
is a common case, the expected quality of the selected pol¬ 
icy improves with more computation. In the most basic set¬ 
ting, a fixed threshold on the number of iterations of the 
inference algorithm can be imposed. In general, however, 
determining when to stop the inference and commit to a 
particular policy, whether in offline or online setting, is a 


rational metareasoning decision ( Russell and Wefald 1991 
Hay et al. 2012|l. Making this decision in an informed and 


systematic way is another topic for research. 


Discussion 

We introduced a new approach to policy learning based on 
casting a policy learning task as a probabilistic program. The 
main contributions of the paper are: 

• Discovery of bilateral correspondence between proba¬ 
bilistic inference and policy learning for path finding. 

• A new approach to policy learning based on the estab¬ 
lished correspondence. 

• A realization of the approach for the Canadian trav¬ 
eller problem, where improved policies were consistently 
learned by probabilistic program inference. 

The proposed approach can be extended to many different 
planning problems, both in well-known path-finding appli¬ 
cations and in other domains involving policy learning un¬ 
der uncertainty; Partially observable Markov Decision Pro¬ 
cesses and generalized Multi-armed bandit settings are just 
two examples. At the same time, the exposure of probabilis¬ 
tic programming tools to different domains and new appli¬ 
cations is challenging. These tools were initially developed 
with certain applications in mind. Our limited experience 
shows that the probabilistic programming paradigm scales 
well to new applications and larger problems. However, as 
more problems are approached using the probabilistic pro¬ 
gramming methodology, apparent weaknesses and limita- 
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